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ABSTRACT 

Searches for statistically significant correlations between arrival directions of 
ultra-high energy cosmic rays and classes of astrophysical objects are common in 
astroparticle physics. We present a method to test potential correlation signals of 
a priori unknown strength and evaluate their statistical significance sequentially, 
i.e., after each incoming new event in a running experiment. The method can be 
applied to data taken after the test has concluded, allowing for further monitoring 
of the signal significance. It adheres to the likelihood principle and rigorously 
accounts for our ignorance of the signal strength. 

Subject headings: cosmic rays — methods: statistical 

1. Introduction 

One of the major goals in astroparticle physics is the identification and the study of 
sources of ultra-high energy cosmic rays, defined as cosmic rays with energies larger than 
10^^ eV. The discovery of discrete sources would answer longstanding questions about how 
and where particles are accelerated to such energies. So far, no discrete sources have been 
positively identified. One major obstacle for the identification of potential sources is the 
small number of detected events. Until a few years ago, the published world data set of 
cosmic rays with energies above 4 x 10^^ eV consisted of little more than 100 events, mainly 
recor ded with the Akeno Giant Air Shower Array (AGASA) in Japan between 1984 and 



between 1997 and 2006 (lAbbasi et al.ll2004t ). 



2003 (ITakeda et al.lll999), and the High Re solution Fly's Eye (HiRes) Experiment in Utah 
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Nevertheless, the small data set has been subjected to exhaustive searches for devi- 
ations from isotropy. These include searches for point sources; searches for an excess of 
clustering in the distribution of arrival directions on various angular scales; and searches for 
correlations with classes of known astrophysical objects that were considered likely sites of 
cosmic ray acceleration. Some of these searches resulted in potential signals, but because 
of the small size of the data set, the statistical significance could not be established in a 
reliable manner. Consequently, while the discovery of discrete sources was claimed repeat- 
edly, statistically independent data routinely failed to support earlier claims. An example 
is the search for correlations of cosmic ray arrival directions with objects of the BL Lac 
class Jxinvakov fc Tkachev l[200ll : ICorbunov et al.l[20oi : kbbasi eTaPbooeh . 



With a new generation of large-aperture astroparticle physics detectors like the Pierre 
Auger Observatory nearing completion in Malargiie, Argentina and the Telescope Array 
detector under construction in Utah, the amount of ultra-high energy data is now growing 
at an unprecedented pace. The Pierre Auger Observatory, for instance, began scientific data 
taking in January 2004 and has already accumulated over 9 x lO^km^sryr of integrated 
exposure, more than any previous experiment. 



1.1. Basic Search Techniques in Cosmic Ray Physics 

The fact that previous experiments have failed to find statistically significant deviations 
from isotropy in skymaps of ultra-high energy cosmic rays can be seen as an indication 
that the sources are weak. In this case, the most promising correlation searches are not 
those which aim at finding sources individually, but rather those conducted on a statistical 
basis; i.e., searches for significant correlations of cosmic ray arrival directions with catalogs 
of astrophysical objects. 

When studying correlations with objects from a source catalog, one tests whether the 
probability p of a given event to arrive from the direction of an object in the catalog is 
significantly larger than the probability po of the correlation occurring by chance. These 
analyses are typically binned, so an event is said to correlate with an object from the catalog 
if the angle between its arrival direction and the object's position is smaller than some angle 
6. If the particles are neutral, 6 could be chosen to reflect the point spread function of 
the detector. In the case of cosmic rays, however, the particles are most likely charged 
and therefore deflected by Galactic and intergalactic magnetic fields of (unknown) strength. 
Consequently, 9 is usually chosen to be larger than the resolution of the detector to account 
for magnetic smearing. 
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Typically, potential signals are identified after intensive searches using different angular 
scales, different energy thresholds, different source catalogs, and other parameters that are 
found to maximize the signal strength. Therefore, an unbiased chance probability for the 
observed signal can only be established by discarding the data set used to find the signal 
and testing the signal with statistically independent data. For the test, the source catalog 
and all analysis parameters are fixed a priori to obtain an unbiased chance probability for 
the signal. 

Once the a priori analysis parameters are identified, the problem is easily formulated in 
terms of a classical hypothesis test, in which new data arc checked for compatibility with a 
null hypothesis TIq ( "the data exhibit no significant correlation" ) or an alternative "signal" 
hypothesis Tii. There are several ways to perform such a test. For example, one can run the 
test after the new data set has reached a certain size n, or after the experiment has run for 
a certain fixed amount of time. 

Formally, the size of the data set and the acceptance or rejection of the null hypothesis 
are determined by two probabilities, a and /3, which are usually chosen before the start of 
the test. These values define the experimenter's tolerance for different sorts of experimental 
errors: a is the probability of wrongly rejecting the null hypothesis when TIq is true (a type-1 

or "false positive" error); and j3 is the probability of wrongly accepting the null hypothesis 
when 7^0 is false (a type-2 or "false negative" error). In a classical one-sided hypothesis test, 
where a p-value P is used to estimate the agreement of the data with the null hypothesis, 
the result P < a implies rejection of TYq at the "confidence level" 1 — a. Meanwhile, the 
desired probability of rejecting a false null hypothesis (1 — /9) fixes the required size of the 
data set (n) . 



1.2. One-Shot vs. Sequential Testing 

If one chooses to evaluate P after a predefined number of events has been recorded, or 
a predefined amount of time has elapsed, then the significance of the signal is tested only 
once. However, it is often desirable to evaluate and test the signal sequentially, i.e., after 
each new event, rather than at the end of the test. This approach allows for the possibility of 
claiming a statistically significant result earlier than with methods that check the signal only 
once, a distinct advantage when event rates are quite low. It also avoids another practical 
disadvantage of hypothesis tests that arises when the experiment, for one reason or another, 
has to discontinue data taking before the predefined number of events is taken. In that case, 
the "one-shot" analysis does not lead to a conclusion. 
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A sequential analysis can be performed in several ways. If P is evaluated after every 
incoming event and not just once after all n events are collected, a "penalty" factor has to be 
inserted to account for the fact that there are n ow more opportunities to satisfy the test by 
chance ( lAnscombe Ill954l : lArmitage et al.lll969l ). This penalty factor can be evaluated with 
simulations and will depend on n. The dependence of P on n is an undesirable feature of 
the method; rather than depending on the data that were actually recorded, P now depends 
on the number of events that an observer would have recorded had he decided to perform 
a "one-shot" test. The interpretation of the data therefore de pends on da ta not actually 
taken. This feature of the test violates the likelihood principle ( iBerry 1119871 ). 



In addition, the inclusion of the penalty factor means that data arriving after the test 
has ended cannot be used to calculate P for the entire data set. It is therefore not possible 
to include new data in the calculation of the probability. In many practical situations, data 
taking continues after the test has ended, and it is highly desirable to monitor the signal 
probability with new data. 



The classical sequential likelihood ratio test developed by IWald I (Il945l . 119471 ) avoids the 
limitations that arise when using the p-value P. Wald defines the likelihood ratio evaluated 



after the n event as 



7^n 



PiplHi 



p{v\no) ' 

where the denominator and numerator represent the probability of observing a data set V 
given a null hypothesis (no correlation) and an alternative (correlation). The ratio TZn can 
be evaluated after each incoming event (i.e. after the n^^ event) without statistical penalty, 
and the test stops with the acceptance or rejection of the null hypothesis when Tin falls below 
or exceeds a predefined value (details will be given in Section |2]). Moreover, the evaluation 
of TZn can continue after the decision to see whether new data continue to favor or disfavor 
the selected hypothesis. 

The probabilities P(V\TCq) and P(V\TCi) in eq. ([1]) depend on the expected correlations 
in case of random coincidences and true signals, respectively. In correlation studies, the 
strength of t he sigri a l is typically not known before the test is complete; so in the analysis 
proposed by IWald I (1 19451 . Il947l ). one simply takes a "best guess" at the lower bound of 
the signal strength. In this paper, we extend Wald's technique to marginalize the signal 
strength, which more rigorously accounts for our ignorance of the true signal. As in the 
classical likelihood ratio test, this extended test can be applied after each new event without 
statistical penalty, so that it adheres to the likelihood principle. It also allows for the 
evaluation of the significance of the signal after the test has been fulfilled, as well as in cases 
where the test stops prematurely. 



We note that the usefulness of this test is not hmited to cosmic ray physics. It can be 
apphed in many other areas of astroparticle physics or astrophysics where event rates are 
low, for example in searches for discrete sources of high energy neutrinos or 7-rays. 



2. 



The Method 



We consider the case of an analysis searching for correlations between cosmic ray arrival 
directions and objects from a catalog. The background probability po is the probability that 
a given event correlates by chance. We want to test the signal probability pi against po. If 
two point hypotheses are tested against each other, pq and pi are single numbers; but in 
general, pi can also have a range of values. If, for example, the "signal" corresponds to a 
stronger correlation than can be expected by chance, then pi > po. 

Since an event can either be correlated with an object from the catalog or not, the 
probability of observing a data set V in which k out of n events correlate with sources is 
given by the binomial distribution 



where p is the probability of a given event to correlate. If the data show no significant 
correlations in addition to those occurring by chance, then p = po- 

In a sequential analysis that tests hypothesis Hi against Hq with data V, the proba- 
bility ratio TZn of eq. ([T]) is calculated after each incoming event, and is then compared to 
two positive constants A and B (where B < A). During each step in the sequence, the 
experimenter is presented with the following possible outcomes: 

1- T^n ^ A'- the test terminates with the rejection of T-Cq. 

2. TZn < B: the test terminates with the acceptance of Hq. 

3. B < Tin < A: the test continues to record data. 





Wald I (I1945I . I1947I ) showed that the constants A and B are closely related to the probabilities 
a and /3 of type-1 and type-2 errors: 
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While it is difficult in most practical situations to estimate exact values for A and B, Wald 
showed that simply choosing 

A=^—^ and B = ^— , (4) 

a 1 — a 

as the test boundaries leads to adequate results if a and (3 are small (typically, they are not 
larger than 0.05). By adequate, we mean that the true type-1 and type-2 rates will never 
exceed a and /3. In fact, the true error rates will often be smaller than the nominal a and (3 
specified before the start of the experiment. 

For a data set that contains n events and k correlations, the likelihood ratio is given by 



P{V\po) Po'(l-Po) 



In practice, the signal strength pi is often not known. We consider here the common case 
of a one-sided test where Po < Pi < I- The confidence in rejecting Hq typically increases with 
increasing p. To evaluate TZn in this case, we can expand the numerator and denominator of 
eq. ([1]) in terms of p: 

^ _ jlp{D\p) P{p\n{)dp 
p{D\p) p{p\no) dp ■ 

The quantities P{p\Hi) and P(p|7io) represent our prior assumptions about p in the 
cases of true signal vs. chance correlations. In cosmic ray studies, the probability pq of a 
chance correlation with a catalog object is estimated from the a priori parameters of the 
test: e.g., the detector exposure to the catalog, the angular bin size 9, etc. In contrast, it 
is fairly uncommon to have a reliable estimate of the signal probability pi beyond the fact 
that pi > pq. Absent further knowledge of the signal, we can therefore treat the probability 
as uniformly distributed on the interval [pi, 1]. Hence, we summarize our prior knowledge of 
the two cases by 

P(p|HO = , (7) 

I- Pi 

P{p\no) = 6{p-po) . (8) 

Note that p is not time-dependent, although we do not see anything inherently problematic 
in inserting a time-dependence. Although not many ultra-high energy cosmic ray models 
propose a time- dependence, if a time- dependent model is inserted for Hq, the probability of 
each sucessive event is evaluated based on what is expected at the time it was measured. 
However, if Tio and Tii are simply wrong - that is, the hypotheses do not properly reflect what 
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could happen in nature - then any resuh is possible. This hazard exists for any hypothesis 
test. 

Solving for the likelihood ratio 7^„, we have 

J^^ p'' {1— pY''^ dp 

" {i-P^Y-^ {I -Pi) 



B{k + l,n-k + l) - B{pi; k + l,n-k + l) 



(10) 



where B(a,6) and B(x;a, 6) are the complete and incomplete beta functions. Note that 
eq. (1101) is a convenient form for the numerical computation of 7^„. 



When nothing is known a priori about the strength of the signal, pi will be chosen close 
to po to test as large a signal space p as possible. If more information on p were available 
— for example, if it were known that p is larger than some value Pmin — then the range of 
integration could be made smaller. To illustrate the merits of improved knowledge, Fig.[T] 
shows TZn as a function of pi for n = 10, = 6, and po = 0.1. Since the "true" probability 
for an event to correlate isp = 6/10 = 0.6, choosing pi close to p increases TZn and therefore 
minimizes the time necessary to confirm the signal. As pi continues to increase beyond the 
true signal probability, TZn decreases, as expected. 

Fig. [2] shows the results of the sequential analysis described above when applied to 
simulated data sets. The background probability is po = 0.1; pi = 0.3 is the minimum signal 
we choose to distinguish from the background; and a = {3 = 0.001. The upper plot shows 
the result of the test for data sets with a correlation probability of p = 0.5 (TYq is false), 
whereas for the bottom plot, p = 0.1 (TYq is true). For both plots, the analysis is performed 
for 10^ Monte Carlo data sets, and the dark and light grey areas indicate the range that 
includes 68% and 95% of the data sets. 



3. The Ratio of Likelihoods, the Ratio of Posteriors, and the Meaning of a 

and (3 



Here, TZn is defined as a ratio of likelihood s, but one could jus t as easily define TZn as a 
ratio of posterior probabilities as suggested by IWald I (Il945l . 119471 ). However, changing the 
definition of TZn carries consequences in the interpretation of a and (3. To understand how, 
we first review what a and j3 mean in the context of the likelihood ratio. 

The meaning of the probabilities in the numerator and denominator of TZn are obviously 
connected to the meaning of a and (3. One could argue that, since we are marginalizing 
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parameters anyway, w e might as w ell calculate the posterior probabilities as suggested in 
Wald's original paper (jWald I Il945l ). This has certain advantages. For instance, the ratio 
would be defined as 



^post 



p(ni\D) p{D\ni)P{ni 



PiHolD) P{D\no)Pino 



(11) 



One could choose priors for PiTii) and P{T-Iq). A and B then become thresholds for "degrees 
of belief" that we must hold for one hypothesis over another before we claim one or the 
other to be true. For instance, given that Hi is true, 1 — /3 becomes the required confidence 
for P{TCi\D) and a the required confidence for P(Ti.o\D) to claim that Hi is true - i.e. 
A = (!-/?)/«. 



However, as noted by IWald I (119451 . 119471 ). the likelihood ratio also has its merits. First 
the likelihood ratio has some precedent. Even those who s ubscribe to the Bayesian formal- 
ism u se marginalized likelihood ratios (i.e. Bayes Factors) (I Jeffreys Ill939l : iKass fc Raftery 



19951 ): using a likelihood ratio avoids the use of priors P{Ho) and P{Hi) which can strongly 
influence the result. Further, likelhood ratios provide like comparisons with likelihood ratios 
used in other analyses with flxed po and pi. However, the deflnitions of A and B become 
cumbersome even in the circumstance here where we are unconcerned whether or not the 
test ever terminates. For instance, given that Hi is true, A parameterizes how much more 
likely the data must come from a universe where Hi is true as opposed to Hq before we claim 
that Hi is indeed true. 

In short, using a ratio of posteriors allows a and j3 to be conceptualized intuitively 
as degrees of belief in one hypothesis or another. Using likelihood ratios is common and, 
while one does not have to contend with deflning priors for Hi and Hq, a and [3 can no 
longer be conceptualized in terms of degrees of belief for Hq and Hi. Here, we opt for the 
more traditional calculation of the likelihood ratio or what could be thought of as a ratio of 
posteriors if P{Hi) = P{Ho). 



4. Testing the Method 

4.1. Test Convergence and the Error Rates a and /3 

To account for our ignorance of the true correlation probability p of the given data set, p 
is marginalized in the likelihoods in eq. As mentioned in the previous section, we assume 
that the signal probability p that we want to test against the null hypothesis is uniformly 
distributed on [pi, 1]. With no prior knowledge of the signal other than p > po, we choose 
Pi = Po- 
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In practice, this approach has an important consequence if one were to interpret the 
results of the hypothesis test in terms of the probabihties a and /5, for example by using 
(1 — a) as a confidence level for the rejection of the null hypothesis. Since the numerator 
now allows for pi < p < 1, a and /3 have, strictly speaking, only meaning for a data set that 
has similar properties, i.e. has a correlation probability that is not a single value, but spread 
over the interval [pi, 1]. However, in reality, any given data set has some fixed probability p 
to correlate with objects of a catalog. 

Therefore, we must test whether in the case of a fixed p the method returns probabilities 
for type-1 and type-2 errors lower than a and In general, we expect the type-2 error to be 
smaller than /? if the correlation probability in the data is larger than some minimum value 

A second practical issue is the convergence of the sequential likelihood ratio test to a 
conclusion in favor of H.^ or Tii. When pi = po and the null hypothesis is true (p = po), 
the ratio test will often fail to reach a conclusion even as the number of events n becomes 
quite large. This problem can be avoided in two ways. One would be to terminate the test 
after accumulating some number of events, ng. The acceptance or rejection of H.^ would 
then depend on whether 7^,„ was greater or less than 1. However, making a decision in this 
way would require a modification of the type-1 and type-2 errors (see Appendix A). Another 
would be to choose Pi = Po + 5, where 5 is a positive constant. The particular choice of 
(5 is somewhat ad iioc, since it mainly reflects the experimenter's degree of belief about 
the strength of the signal. However, for those uncomfortable with this kind of inference, 
we present a simple procedure to find b such that: the likelihood ratio V^n converges to a 
conclusion while still satisfying a large number of signal hypotheses; and the type-1 and 
type-2 rates of the sequential analysis are consistent with the classical interpretations of the 
probabilities a and /3. 

In this section, we test these expectations with simulated data sets and determine values 
for 5 and pmin for some typical values for po, a, and /5. If we find 5 to be small and pmin to be 
close to Po, then the test will terminate with type-1 and type-2 error rates that are smaller 
than a and /3, giving the result an intuitive interpretation. For each of the following tests, 
we produce 10^ simulated data setf] with a correlation probability p and subject these data 
sets to a sequential analysis with predefined values for a and /?. 

Case 1: T^q is True: First consider the case where the null hypothesis is true, so that 
the correlation probability p of the data is equal to po. The dark grey area in Fig. [3] indicates, 
as a function of po, the range pi > po for which the ratio test terminates with a type-1 error 



^We will use a = /? = 0.001, and therefore test the method on 10 x 1/0.001. 
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probability greater than a. Note that when pi ~ po? there is a large fraction of data sets in 
which the test does not come to a conclusion (rejection or acceptance of the null hypothesis) 
even when the number of events n exceeds 1000. The fraction of undecided tests is added to 
the type-1 error rate to give a conservative limit on pi. For all pi that fall above the dark 
grey area, the test terminates with a type-1 error rate less than a. As expected, the dark 
grey range is narrow, so the test is "well-behaved" if is chosen not too close to po- As 
an example, if the random correlation probability po = 0.1, then pi = 0.14 (5 = 0.04). Any 
values for pi larger than 0.14 will of course also be well-behaved. 

Case 2: TYq is False: We now consider the case where the null hypothesis is false. 
Choosing the values for pi determined with the procedure outlined in "Case 1," we use 
simulated data to find the minimum signal probability Pmin for which the ratio test terminates 
with a type-2 error probability less than /3. The light grey area in Fig. [3] depicts, as a function 
of po, the range of pmin > V\ foi' which the ratio test terminates with a type-2 probability 
greater than /?. For instance, when po = 0.1 and a = /3 = 0.01, for all signal probabilities 
V > Pmin = 0.18 the ratio test will terminate with a type-2 error probability less than (3. 
Note that the pmin values given here are conservative, since they not only require a type-2 
error below P in case of a signal with strength Pmin, but also a type-1 rate below a and a 
rejection or acceptance of Tio before the sample size n reaches 1000 when TYq is true. This 
last requirement slightly inflates the value of Pmin- 

The simulations of Cases 1 and 2 indicate that p and pi must be larger than po if the 
test is to arrive at a decision in a reasonable amount of time, and if the results are to be 
consistent with the error probabilities a and f3. (To a much lesser extent, this second issue 
also exists in Wald's or iginal formu l ation of the ratio test, in which pi is treated as a single 
alternative probability (IWald Ill945l . 119471 ).) Even so, the amounts by which p and pi should 
differ from po are small enough that they do not appreciably limit the usefulness of the 
method when a "classical" interpretation of a and f3 is required. We note that the existence 
of small intervals above po where suc h an interpr e tation is not possible are a typical feature 
of sequential tests; see, for example (IWald Ill945l . 119471 : iLewis fc Berry Ill994l ). It should be 
stressed, however, that we have not demonstrated a circumstance where we are obtaining 
some undesired values for a and (3. Rather, we have demonstrated that marginalizing the 
likelihood is not the equivalent of inserting the right value for p. 



4.2. Efficiency of the Ratio Test 



An important aspect of a sequential test is its length, i.e., the number of events n 
necessary to reach a decision. Fig.H] shows an example for the typical length of the test as 
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a function of the signal probability p. In this example, the background probability is chosen 
as po = 0.1, the lower boundary of the marginalization is pi = 0.3, and a = (3 = 0.001. 
For 10^ simulated data sets, Fig.|l](top) shows the median number of events required for a 
termination of the test. The error bars indicate the range that includes 68% of the data 
sets. In this example, the median size of a data set required to accept the null hypothesis if 
it is true {po = 0.1) is 27. The median size of a data set required to reject the null hypothesis 
if it is wrong depends on p and is large when p is close to Pq. Above p ~ 0.6, the median 
number reaches a plateau of about 7 events. 

Fig-lll(bottom) shows which decision is actually made, depicting the fraction of data 
sets for which the null hypothesis [po = 0.1) is accepted and the fraction for which it is 
rejected, ClS db function of the signal probability. 

Comparing the length of the test with the marginalized likelihood to Wald's original test 
is not straightforward, since the length of each test depends on the specifics of the problem, 
and because the probability pi has quite a different meaning for the two methods. However, 
we find that the marginalized test tends to require fewer events when pi is the same in 
both tests. For the above example, the median number of events required to accept the null 
hypothesis if it is true is 55 and thus twice as large as for the marginalized likelihood ratio. 
For signal probabilities p > 0.6, the Wald test reaches a plateau that is roughly comparable 
to the marginalized test. Fig. [5] shows the median number of events required for the Wald 
test for pi = 0.3 and a = (3 = 0.001. 



5. Summary 



We have outlined a sequential analysis technique for testing a point null hypothesis 
with probability po against a signal pro bability p. The method is based on the sequential 
analysis proposed in IWald I (Il945l . 119471 ). but replacing the likelihood ratio used to evaluate 
the significance of a signal with one that marginalizes the signal strength. 

In many sequential tests, the signal strength is unknown when the test starts. Typically, 
the signal probability p can in principle have a ny value in the int erval [po, 1]. Rather than 
choosing a fixed threshold for p, as suggested in IWald I (Il945l . 119471 ). we have argued that, in 
general, the better alternative is to marginalize p and account for our ignorance exactly. In 
the marginalization of the signal likelihood, the integration starts at some value Pi = Po + ^, 
where 6 is an ad hoc parameter reflecting the experimenter's belief about the strength of the 
signal, the capability of his experiment, and other a priori knowledge. 



Because of the integration of the signal likelihood over a range in p, the parameters a 
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and (3 have lost their intuitive meaning if the method is apphed to data sets where p is fixed, 
as is typically the case for real data. However, we have shown that for most values of 5 
and p that occur in correlation searches, the type-1 and type-2 error rates of the sequential 
analysis are consistent with the classical interpretations of the probabilities a and (3. 

Note that we have run a test with one of two outcomes (i.e., an acceptance or rejection 
oi l-in), defining a an d /3, rather than one outcome (say, only a rejection of TIq) such as 



m 



Darling et al.l (119681 ). The latter case supposes that we are only concerned about reporting 
a signal. However, it is important to state a null result at some point in the interest of 
reducing reporting bias. That is, it is important to ensure that 1% of the results that claim 
an excess of events are indeed a 1% effect. 

The sequential analysis technique proposed here is efficient, allows the signal significance 
to be evaluated after the test has been fulfilled, adheres to the likelihood principle, and 
rigorously accounts for our ignorance of the signal strength. 



We thank Diego Harari, Antoine Letessier-Selvon, and John A.J. Matthews for valuable 
discussions and help. This work is supported by the National Science Foundation under 
contract numbers NSF-PHY-0500492 and NSF-PHY-0636875. 



A. The Truncated Sequential Analysis Test 

In practice, the test must end. It is supposed that a decision to accept or reject the null 
hypothesis must be made when n = no if it has not been ma de already for n < hq. Following 



the derivation of the modified errors for truncated tests in IWald I (Il945l ). a(no) and (3{no) 
are defined as the probabilities of errors of the first and second kinds if the test is truncated 
at n = uq. The objective is then to derive an upper bound on a{nQ) and PIuq) such that (1) 
the test ends prematurely and (2) Hi is accepted if Rn^ > 1 and Hq is accepted in i?^^ < 1. 
In doing so, we find a suitable S and no where a and f3 are small. 

First, po(^o) is defined as the probability that, under the null hypothesis, 

1. B < Rn,^i < A 

2. 1 < Rn, < A 

3. The sequential analysis would terminate with an acceptance of Hq if allowed to con- 
tinue. 



-13- 



For the truncated test, we are rejecting the null hypothesis if 1 < < A. In other words, 
Po('^o) is the probability of wrongly rejecting the null hypothesis when 1 < Rn^ < A when 
it would have terminated with a rejection of the null hypothesis wanted if we let the test 
continue. This is added to the probability that the test would terminate wrongly if we let it 
continue. Therefore, the upper bound on a{no) can be expressed as 

a{no) < a + po{no). (Al) 

Now if po(^o) is simply the probability under the null hypothesis that 1 < < A, then 
Po(^o) < p(^o) and therefore 

a{no) <a + po(no). (A2) 
Similarly, pi(no) is defined as the probability that, under the "signal" hypothesis, 

1. B< Rn,-l < A 

2. B<R^,<1 

3. The sequential analysis would terminate with an acceptance of Hi if allowed to con- 
tinue. 

and 

/9(no) </3 + pi(no). (A3) 
where pi(no) is defined to be the probability under the signal hypothesis that B < Rn^ < 1. 

We then calculate po(^o) exphcitly. The probabihty of obtaining Rn^ > 1 if the null 
hypothesis is true is 

Po(no) = n)Poi^-Por-' (A4) 
fci+ 

where A;i+ is the minimum integer k for which 



Po(l-Po)"°-'= 
and kA is the maximum integer k for which 

1-po-S Jpo+sP V-^ P) 
^0(1-^0)"°-^= 



> 1 (A5) 



< A (A6) 
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Similarly, 



^0 \ k ) l~po-5 Jpo+S^ i^' 



where is the maximum integer k for which 



and is the minimum integer k for which 

1-P0-.5 Jpo+&t' \^ P) 

pg(l-po)"°-^ 



1 jjfcQ _ jjVno-fc 
l-po-<5 Jpo+5^ ^ ^' ^ (A8) 



> B (A9) 



Under this scheme, Fig.|6]shows po('^o) and piino) as a function of 5 and hq. It shows that 
a rather large 5 (~ 0.7) is required to bring pi(no) and pi(r2o) to be less than a = (3 = 0.001. 
Further, if the calculation is extended we find that it would take ~ 180 events to bring pi(no) 
and pi(r2o) to be ~ for any 5. 
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Fig. 1. — Likelihood ratio as a function of pi for n = 10, k = 6, and po = 0.1. 
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Fig. 2. — Likelihood ratio as a function of the number of events for a background probabihty 
Po = 0.1, pi = 0.3, and a signal probability p = 0.5 (top) and p = 0.1 (bottom). The ratio is 
calculated for 10^ random data sets. The plots show the median (dark grey dots) together 
with the range that includes 68 % and 95 % of the data sets (dark and hght grey areas) . The 
values for the test boundaries A and B ior a — P — 0.001 are indicated as dashed and dotted 
hues. 
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Fig. 3. — Range for pi > po for which the ratio test terminates with type-1 error probabihties 
greater than a (dark grey), as a function of Pq. Range for p > pi for which the ratio test 
terminates with type-2 error probabihties greater than as a function of po (hght grey). 
The upper plot is lor a — (3 — 0.01, the lower plot lor a — (3 — 0.001. 
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Fig. 4. — Top: Median number of events necessary for the sequential test to come to a con- 
clusion, as a function of the signal probability p. In this example, the background probability 
is po — 0.1, and pi — 0.3, a — P — 0.001. Error bars indicate the range that includes 68% 
of the simulated data sets. Bottom: For the same simulated data sets, fraction of data sets 
for which the null hypothesis is accepted (solid line) and rejected (dotted hne) as a function 
of the signal probability p for a background probability po — 0.1. 
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Fig. 5. — Median number of events necessary for the Wald sequential test to come to a con- 
clusion (open circles), as a function of the signal probability p, compared to the marginalized 
likelihood ratio test (filled circles). The fixed point pi is the same in both cases. For this 
example, the background probability is po = 0.1, and pi = 0.3, a = P = 0.001. Error bars 
indicate the range that includes 68 % of the simulated data sets. 
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Fig. 6. — The added error for a, po{no), and (3, pi(no), as a function of S, wherep is integrated 
from pq + 5 to 1, and the number of events at which the test is truncated, no. 



